PonIC: Using Stratosphere to Speed Up Pig Analytics
نویسندگان
چکیده
Pig, a high-level dataflow system built on top of Hadoop MapReduce, has greatly facilitated the implementation of data-intensive applications. Pig successfully manages to conceal Hadoop’s one input and two-stage inflexible pipeline limitations, by translating scripts into MapReduce jobs. However, these limitations are still present in the backend, often resulting in inefficient execution. Stratosphere, a data-parallel computing framework consisting of PACT, an extension to the MapReduce programming model and the Nephele execution engine, overcomes several limitations of Hadoop MapReduce. In this paper, we argue that Pig can highly benefit from using Stratosphere as the backend system and gain performance, without any loss of expressiveness. We have ported Pig on top of Stratosphere and we present a process for translating Pig Latin scripts into PACT programs. Our evaluation shows that Pig Latin scripts can execute on our prototype up to 8 times faster for a certain class of applications.
منابع مشابه
Applying Stratosphere for Big Data Analytics
Analyzing big data sets as they occur in modern business and science applications requires query languages that allow for the specification of complex data processing tasks. Moreover, these ideally declarative query specifications have to be optimized, parallelized and scheduled for processing on massively parallel data processing platforms. This paper demonstrates the application of Stratosphe...
متن کاملPerformance Optimization Techniques and Tools for Data-Intensive Computation Platforms An Overview of Performance Limitations in Big Data Systems and Proposed Optimizations
Big data processing has recently gained a lot of attention both from academia and industry. The term refers to tools, methods, techniques and frameworks built to collect, store, process and analyze massive amounts of data. Big data can be structured, unstructured or semi-structured. Data is generated from various different sources and can arrive in the system at various rates. In order to proce...
متن کاملExecution Primitives for Scalable Joins and Aggregations in Map Reduce
Analytics on Big Data is critical to derive business insights and drive innovation in today’s Internet companies. Such analytics involve complex computations on large datasets, and are typically performed on MapReduce based frameworks such as Hive and Pig. However, in our experience, these systems are still quite limited in performing at scale. In particular, calculations that involve complex j...
متن کاملBig Data Analytics and Now-casting: A Comprehensive Model for Eventuality of Forecasting and Predictive Policies of Policy-making Institutions
The ability of now-casting and eventuality is the most crucial and vital achievement of big data analytics in the area of policy-making. To recognize the trends and to render a real image of the current condition and alarming immediate indicators, the significance and the specific positions of big data in policy-making are undeniable. Moreover, the requirement for policy-making institutions to ...
متن کاملSensitivity studies of the recent new data on O(1D) quantum yields in O3 Hartley band photolysis in the stratosphere
The production yields of excited oxygen O(1D) atoms from the near ultraviolet O3 photolysis are essential quantities for atmospheric chemistry calculations because of its importance as major sources of hydroxyl (OH) radicals and nitric oxide (NO). Recently, new O(1D) quantum yields from O3 photolysis between 230 and 305 nm in the Hartley band region were reported, which are almost independent o...
متن کامل